gpu acceleration
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression
RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)--all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity--combining upper-triangle storage with block-sparse thresholding--achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler's original methodology.
GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration
Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from O(n^3) to O(n^2). Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
- Information Technology > Modeling & Simulation (0.67)
- Information Technology > Hardware (0.67)
- Information Technology > Graphics (0.67)
- Information Technology > Artificial Intelligence > Machine Learning (0.41)
Characterizing and Understanding Energy Footprint and Efficiency of Small Language Model on Edges
Islam, Md Romyull, Deng, Bobin, Dhar, Nobel, Nguyen, Tu N., He, Selena, Shi, Yong, Suo, Kun
Cloud-based large language models (LLMs) and their variants have significantly influenced real-world applications. Deploying smaller models (i.e., small language models (SLMs)) on edge devices offers additional advantages, such as reduced latency and independence from network connectivity. However, edge devices' limited computing resources and constrained energy budgets challenge efficient deployment. This study evaluates the power efficiency of five representative SLMs - Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 on Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU configurations). Results show that Jetson Orin Nano with GPU acceleration achieves the highest energy-to-performance ratio, significantly outperforming CPU-based setups. Llama 3.2 provides the best balance of accuracy and power efficiency, while TinyLlama is well-suited for low-power environments at the cost of reduced accuracy. In contrast, Phi-3 Mini consumes the most energy despite its high accuracy. In addition, GPU acceleration, memory bandwidth, and model architecture are key in optimizing inference energy efficiency. Our empirical analysis offers practical insights for AI, smart systems, and mobile ad-hoc platforms to leverage tradeoffs from accuracy, inference latency, and power efficiency in energy-constrained environments.
- Energy (1.00)
- Information Technology > Services (0.48)
- Information Technology > Hardware (0.37)
torchsom: The Reference PyTorch Library for Self-Organizing Maps
Berthier, Louis, Shokry, Ahmed, Moreaud, Maxime, Ramelet, Guillaume, Moulines, Eric
This paper introduces torchsom, an open-source Python library that provides a reference implementation of the Self-Organizing Map (SOM) in PyTorch. This package offers three main features: (i) dimensionality reduction, (ii) clustering, and (iii) friendly data visualization. It relies on a PyTorch backend, enabling (i) fast and efficient training of SOMs through GPU acceleration, and (ii) easy and scalable integrations with PyTorch ecosystem. Moreover, torchsom follows the scikit-learn API for ease of use and extensibility.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Puy-de-Dôme > Clermont-Ferrand (0.04)
DB-TSDF: Directional Bitmask-based Truncated Signed Distance Fields for Efficient Volumetric Mapping
Maese, Jose E., Merino, Luis, Caballero, Fernando
Abstract-- This paper presents a high-efficiency, CPU-only volumetric mapping framework based on a Truncated Signed Distance Field (TSDF). A key feature of the approach is that the processing time per point-cloud remains constant, regardless of the voxel grid resolution, enabling high resolution mapping without sacrificing runtime performance. In contrast to most recent TSDF/ESDF methods that rely on GPU acceleration, our method operates entirely on CPU, achieving competitive results in speed. Experiments on real-world open datasets demonstrate that the generated maps attain accuracy on par with contemporary mapping techniques. V olumetric mapping is a fundamental capability in mobile robotics, supporting tasks such as collision avoidance, motion planning, and the construction of consistent world models under real-time constraints. Point clouds and occupancy grids remain widely used on CPU-only platforms, as their simple data structures allow efficient processing without specialized hardware. However, they are prone to aliasing at high resolutions and often produce geometric artifacts that hinder downstream processing.
- South America > Brazil > Rio de Janeiro > South Atlantic Ocean (0.04)
- Europe > Spain > Andalusia > Seville Province > Seville (0.04)
Distilled Large Language Model in Confidential Computing Environment for System-on-Chip Design
Ben, Dong, Feng, Hui, Wang, Qian
Large Language Models (LLMs) are increasingly used in circuit design tasks and have typically undergone multiple rounds of training. Both the trained models and their associated training data are considered confidential intellectual property (IP) and must be protected from exposure. Confidential Computing offers a promising solution to protect data and models through Trusted Execution Environments (TEEs). However, existing TEE implementations are not designed to support the resource-intensive nature of LLMs efficiently. In this work, we first present a comprehensive evaluation of the LLMs within a TEE-enabled confidential computing environment, specifically utilizing Intel Trust Domain Extensions (TDX). We constructed experiments on three environments: TEE-based, CPU-only, and CPU-GPU hybrid implementations, and evaluated their performance in terms of tokens per second. Our first observation is that distilled models, i.e., DeepSeek, surpass other models in performance due to their smaller parameters, making them suitable for resource-constrained devices. Also, in the quantized models such as 4-bit quantization (Q4) and 8-bit quantization (Q8), we observed a performance gain of up to 3x compared to FP16 models. Our findings indicate that for fewer parameter sets, such as DeepSeek-r1-1.5B, the TDX implementation outperforms the CPU version in executing computations within a secure environment. We further validate the results using a testbench designed for SoC design tasks. These validations demonstrate the potential of efficiently deploying lightweight LLMs on resource-constrained systems for semiconductor CAD applications.
- North America > United States > California > Merced County > Merced (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Semiconductors & Electronics (1.00)
- Information Technology > Security & Privacy (1.00)
EvoGP: A GPU-accelerated Framework for Tree-based Genetic Programming
Wang, Lishuang, Wu, Zhihong, Sun, Kebin, Li, Zhuozhao, Cheng, Ran
Tree-based Genetic Programming (TGP) is a key evolutionary algorithm widely used in symbolic regression, feature engineering, and scientific modeling. Its high computational demands make GPU acceleration essential for scalable and high-performance evolutionary computation. However, GPU acceleration of TGP faces three key challenges: inefficient tree encoding, highly heterogeneous genetic operations, and limited parallelism in fitness evaluation. To address these challenges, we introduce EvoGP, a comprehensive GPU-accelerated TGP framework. First, we design a tensorized encoding scheme to represent tree with different structures as tensors with the same shape, optimizing memory access and enabling efficient parallel execution. Second, we propose a unified parallel framework for genetic operations by leveraging shared computational primitives and implementing dedicated CUDA kernels for scalable performance. Third, we present a fully parallel fitness evaluation strategy for symbolic regression, exploiting both population-level and data-level parallelism to maximize GPU utilization. Moreover, we implement a comprehensive library to provide rich algorithm operators and benchmark problems. EvoGP is extensively tested on various tasks, including symbolic regression, classification, and robotics control, demonstrating its versatility and effectiveness across diverse application scenarios. Experimental results show that EvoGP achieves up to a 140.89x speedup over the state-of-the-art GPU-based TGP implementation, while maintaining or exceeding the accuracy of baseline methods. EvoGP is open-source and accessible at: https://github.com/EMI-Group/evogp.
- Asia > China > Hong Kong (0.04)
- Oceania > New Zealand (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Hardware (1.00)
- Information Technology > Graphics (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
WebLLM: A High-Performance In-Browser LLM Inference Engine
Ruan, Charlie F., Qin, Yucheng, Zhou, Xun, Lai, Ruihang, Jin, Hongyi, Dong, Yixin, Hou, Bohan, Yu, Meng-Shiun, Zhai, Yiyan, Agarwal, Sudeep, Cao, Hangrui, Feng, Siyuan, Chen, Tianqi
Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.
torchmSAT: A GPU-Accelerated Approximation To The Maximum Satisfiability Problem
Hosny, Abdelrahman, Reda, Sherief
The remarkable achievements of machine learning techniques in analyzing discrete structures have drawn significant attention towards their integration into combinatorial optimization algorithms. Typically, these methodologies improve existing solvers by injecting learned models within the solving loop to enhance the efficiency of the search process. In this work, we derive a single differentiable function capable of approximating solutions for the Maximum Satisfiability Problem (MaxSAT). Then, we present a novel neural network architecture to model our differentiable function, and progressively solve MaxSAT using backpropagation. This approach eliminates the need for labeled data or a neural network training phase, as the training process functions as the solving algorithm. Additionally, we leverage the computational power of GPUs to accelerate these computations. Experimental results on challenging MaxSAT instances show that our proposed methodology outperforms two existing MaxSAT solvers, and is on par with another in terms of solution cost, without necessitating any training or access to an underlying SAT solver. Given that numerous NP-hard problems can be reduced to MaxSAT, our novel technique paves the way for a new generation of solvers poised to benefit from neural network GPU acceleration.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- (5 more...)